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Abstract 

Context sensitive rewrite rules have been widely used in several areas of natural language 
processing, including syntax, morphology, phonology and speech processing. Kaplan and Kay, 
Karttunen, and Mohri & Sproat have given various algorithms to compile such rewrite rules 
into finite-state transducers. The present paper extends this work by allowing a limited form 
of backreferencing in such rules. The explicit use of backreferencing leads to more elegant 
and general solutions. 

1 Introduction 

Context sensitive rewrite rules have been widely used in several areas of natural language process- 
ing. Johnson Q has shown that such rewrite rules are equivalent to finite state transducers in the 
special case that they are not allowed to rewrite their own output. An algorithm for compilation 
into transducers was provided by 1^. Improvements and extensions to this algorithm have been 
provided by 0, Q and In this paper, the algorithm will be extended to provide a limited 
form of backreferencing. Backreferencing has been implicit in previous research, such as in the 
"batch rules" of bracketing transducers for finite-state parsing and the "LocalExtension" 
operation of |l3|. The explicit use of backreferencing leads to more elegant and general solutions. 

Backreferencing is widely used in editors, scripting languages and other tools employing regular 
expressions For example, Emacs uses the special brackets \( and \) to capture strings along 
with the notation \n to recall the nth such string. The expression \(a*\)b\l matches strings 
of the form aJ^ba^ . Unrestricted use of backreferencing thus can introduce non-regular languages. 
For NLP finite state calculi ^ this is unacceptable. The form of backreferences introduced in 
this paper will therefore be restricted. 

The central case of an allowable backreference is: 



X ^ T{x)/\__p (1) 

This says that each string x preceded by A and followed by p is replaced by T{x), where A and p are 
arbitrary regular expressions, and T is a transducer]^ This contrasts sharply with the rewriting 
rules that follow the tradition of Kaplan & Kay: 

^Thc syntax at this point is merely suggestive. As an example, suppose that Tacr transduces phrases into 
acronyms. Then 

X => Tacr(a;)/(abbr>..(/abbr> 

would transduce <abbr>non-deterministic finite automaton</abbr> into <abbr>NDFA</abbr>. 

To compare this with a backreference in Perl, suppose that Tacr is a subroutine that converts phrases into 
acronyms and that Racr is a regular expression matching phrases that can be converted into acronyms. Then 
(ignoring the left context) one can write something like: s/(_Racr)(?=(/ABBR))/Tacr(Sl)/ge;. The backreference 
variable, $1, will be set to whatever string Racr matches. 
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^ V-ZA-P (2) 

In this case, any string from the language (j) is replaced by any string independently chosen from 
the language ?/'• 

We also allow multiple (non-permuting) backreferences of the form: 



X1X2 . ..Xn^ Ti{xi)T2{x2) ■ . .Tn{Xn)/X—P (3) 

Since transducers are closed under concatenation, handling multiple backreferences reduces to the 
problem of handling a single backreference: 



x^{T,-T2-...-T,,){x)/X..p (4) 

A problem arises if we want capturing to follow the POSIX standard requiring a longest- 
capture strategy. Friedl |^ (p. 117), for example, discusses matching the regular expression 
(to|top)(o|polo)?(gical|o?logical) against the word: topological. The desired result is that (once 
an overall match is established) the first set of parentheses should capture the longest string pos- 
sible (top); the second set should then match the longest string possible from what's left (o), and 
so on. Such a left-most longest match concatenation operation is described in §^ 

In the following section, we initially concentrate on the simple case in (^ and show how (|l|) 
may be compiled assuming left-to-right processing along with the overall longest match strategy 
described by ^ . 

The major components of the algorithm are not new, but straightforward modifications of 
components presented in Q and We improve upon existing approaches because we solve a 



problem concerning the use of special marker symbols (§2.1.2). A further contribution is that all 
steps are implemented in a freely available system, the FSA Utilities of ^ ( $.l.l| ). 



2 The Algorithm 



2.1 Preliminary Considerations 

Before presenting the algorithm proper, we will deal with a couple of meta issues. First, we 
introduce our version of the finite state calculus in § ^.1.1[ . The treatment of special marker 
symbols is discussed in 
the algorithm. 



2.1.2. Then in §2.1.3, we discuss various utilities that will be essential for 



2.1.1 FSA Utilities 

The algorithm is implemented in the FSA Utilities We use the notation provided by the 

toolbox throughout this paper. Table |l| lists the relevant regular expression operators. FSA 
Utilities offers the possibility to define new regular expression operators. For example, consider 
the definition of the nuUary operator vowel as the union of the five vowels: 

macro (vowel, {a, e,i,o,u}) . 

In such macro definitions, Prolog variables can be used in order to define new n-ary regular 
expression operators in terms of existing operators. For instance, the lenient_composition operator 
is defined by: 

macro (priority_union(Q ,R) , 

{q, ~domain(Q) o R}) . 
macro (lenient_composition(R,C) , 

priority_union(R o C,R)). 
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[] 

[El , . . . En] 

{} 

{El, . . .En} 



empty string 

concatenation of El ... En 

empty language 

union of El , ... En 

Kleene closure 

optionality 

complement 

difference 

containment 

intersection 

any symbol 

pair 

cross-product 
composition 

domain of a transduction 
range of a transduction 
identity transduction 
inverse transduction 



E* 
E- 

~E 
E1-E2 
$ E 



El & E2 
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A:B 



El X E2 



A o B 



domain (E) 
range (E) 
identity (E) 
inverse (E) 



Table 1: Regular expression operators. 



Here, priority_union of two regular expressions Q and R is defined as the union of Q and the 
composition of the complement of the domain of Q with R. Lenient composition of R and C is 
defined as the priority union of the composition of R and C (on the one hand) and R (on the other 
hand) . 

Some operators, however, require something more than simple macro expansion for their def- 
inition. For example, suppose a user wanted to match n occurrences of some pattern. The FSA 
Utilities already has the '*' and quantifiers, but any other operators like this need to be user 
defined. For this purpose, the FSA Utilities supplies simple Prolog hooks allowing this general 
quantifier to be defined as: 

macro (mat ch_n (N, X) ,Regex) :- 
match_n(N,X,Regex) . 

match_n(0,_X, [] ) . 
match_n(N,X, [XiRest]) :- 

N > 0, 

Nl is N-1, 

match_n(Nl ,X,Rest) . 

For example: match_n(3,a) is equivalent to the ordinary finite state calculus expression [a, a, a] . 

Finally, regular expression operators can be defined in terms of operations on the underlying 
automaton. In such cases, Prolog hooks for manipulating states and transitions may be used. This 
functionality has been used in [ p^ to provide an implementation of the algorithm in . 

2.1.2 Treatment of Markers 

Previous algorithms for compiling rewrite rules into transducers have followed Q by introducing 
special marker symbols {markers) into strings in order to mark off candidate regions for replace- 
ment. The assumption is that these markers are outside the resulting transducer's alphabets. But 
previous algorithms have not ensured that the assumption holds. 

This problem was recognized by , whose algorithm starts with a filter transducer which filters 
out any string containing a marker. This is problematic for two reasons. First, when applied to 
a string that does happen to contain a marker, the algorithm will simply fail. Second, it leads 
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to logical problems in the interpretation of complementation. Since the complement of a regular 
expression R is defined as S — i?, one needs to know whether the marker symbols are in S or not. 
This has not been clearly addressed in previous literature. 

We have taken a different approach by providing a contextual way of distinguishing markers 
from non-markers. Every symbol used in the algorithm is replaced by a pair of symbols, where 
the second member of the pair is either a or a 1 depending on whether the first member is a 
marker or not.^ As the first step in the algorithm, O's are inserted after every symbol in the input 
string to indicate that initially every symbol is a non-marker. This is defined as: 

macro (non_markers , [? , [] : 0] *) . 

Similarly, the following macro can be used to insert a after every symbol in an arbitrary 
expression E. 

macro (non_markers (E) , 

range (E o non_markers) ) . 

Since E is a recognizer, it is first coerced to identity (E) . This form of implicit conversion is 
standard in the finite state calculus. 

Note that and 1 are perfectly ordinary alphabet symbols, which may also be used within a 
replacement. For example, the sequence [1,0] represents a non-marker use of the symbol 1. 

2.1.3 Utilities 

Before describing the algorithm, it will be helpful to have at our disposal a few general tools, most 
of which were described already in These tools, however, have been modified so that they 
work with our approach of distinguishing markers from ordinary symbols. So to begin with, we 
provide macros to describe the alphabet and the alphabet extended with marker symbols: 

macro (sig, [?,0] ) . 
macro (xsig, [?,{0, 1}] ) . 

The macro xsig is useful for defining a specialized version of complementation and contain- 
ment: 

macro (not (X) , xsig* - X). 
macro($$(X) , [xsig*, X, xsig*] ) . 

The algorithm uses four kinds of brackets, so it will be convenient to define macros for each of 
these brackets, and for a few disjunctions. 

macro (Ibl, ['<!' ,1]) . 
macro (lb2, ['<2' ,1]) . 
macro (rb2, ['2>' ,1]) . 
macro (rbl, ['!>' ,1]) . 
macro (lb, {Ibl, lb2» . 
macro (rb , {rbl , rb2}) . 
macro(bl,{lbl,rbl}) . 
macro (b2,{lb2,rb2» . 
macro (brack , {lb , rb}) . 

^This approach is similar to the idea of laying down tracks as in the compilation of monadic second-order 
logic into automata Klarlund (1997, p. 5). In fact, this technique could possibly be used for a more efficient 
implementation of our algorithm; instead of adding transitions over and 1, one could represent the alphabet as 
bit sequences and then add a final bit for any ordinary symbol and a final 1 bit for a marker symbol. 
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As in Kaplan & Kay, we define an Intro(S) operator that produces a transducer that freely 
introduces instances of S into an input string. We extend this idea to create a family of Intro 
operators. It is often the case that we want to freely introduce marker symbols into a string at 
any position except the beginning or the end. 

%7, Free introduction 

macro (intro (S) ,{xsig-S, [] x S}*) . 

7,7, Introduction, except at begin 

macro (xintro (S) , { [] , [xsig-S , intro (S) ] }) . 

7,7, Introduction, except at end 
macro(introx(S) ,-[ [] , [intro(S) , xsig-S] }) . 

7,7, Introduction, except at begin & end 
macro (xintrox(S) ,{ [] , [xsig-S] , 
[xsig-S , intro (S) , xsig-S] }) . 

This family of Intro operators is useful for defining a family of Ignore operators: 

macro ( ign( El, S) .range (El o intro ( S))). 
macro (xign( El, S) , range (El o xintro ( S))). 
macro( ignx(El ,S) ,range(El o introx(S))). 
macro (xignx (El, S) , range (El o xintrox(S) ) ) . 

In order to create filter transducers to ensure that markers are placed in the correct positions, 
Kaplan & Kay introduce the operator P-iff-S(Ll,L2). A string is described by this expression 
iff each prefix in LI is followed by a suffix in L2 and each suffix in L2 is preceded by a prefix in LI. 
In our approach, this is defined as: 

macro(if _p_then_s(Ll,L2) , 

not( [Ll,not(L2)] )) . 
macro (if _s_then_p(Ll ,L2) , 

not ([not (LI), L2])). 
macro(p_iff_s(Ll,L2) , 

if_p_then_s(Ll,L2) 
& 

if_s_then_p(Ll,L2)) . 

To make the use of p_if f _s more convenient, we introduce a new operator l_if f jr (L , R) , which 
describes strings where every string position is preceded by a string in L just in case it is followed 

by a string in R: 

macro(l_if f _r (L,R) , 

p_iff _s( [xsig*,L] , [R,xsig*] )) . 

Finally, we introduce a new operator if (Condition, Then, Else) for conditionals. This oper- 
ator is extremely useful, but in order for it to work within the finite state calculus, one needs a 

convention as to what counts as a boolean true or false for the condition argument. It is possible 
to define true as the universal language and false as the empty language: 

macro (true,? *) . macro(f alse,{}) . 

With these definitions, wc can use the complement operator as negation, the intersection 
operator as conjunction and the union operator as disjunction. Arbitrary expressions may be 
coerced to booleans using the following macro: 
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macro (replace (T, Left, Right) , 

iioii_markers 7, 

o 7. 

r (Right) 7. 

o 7. 

f (domain (T)) 7. 

o 7. 

left_to_right (domain (T) ) 7o 

o 7. 

longest_match (domain (T) ) 7o 

o 7. 

aux_replace (T) 7o 

o 7. 

11 (Left) 7. 

o 7. 

12 (Left) 7. 

o 7. 

inverse (non_markers) ) . 7o 



introduce after every symbol 
(a b c => a b c 0) . 
introduce rb2 before any string 
in Right . 

introduce lb2 before any string in 

domain(T) followed by rb2 . 

Ib2 . . . rb2 around domain (T) optionally 

replaced by Ibl . . . rbl 

filter out non-longest matches marked 

in previous step. 

perform T's transduction on regions marked 
off by bl's. 

ensure that Ibl must be preceded 
by a string in Left. 

ensure that lb2 must not occur preceded 
by a string in Left, 
remove the auxiliary O's. 



Figure 1: Definition of replace operator. 

macro (coerce_to_boolean(E) , 

range (E o (true x true))). 

Here, E should describe a recognizer. E is composed with the universal transducer, which transduces 
from anything (?*) to anything (?*). Now with this background, we can define the conditional: 

macro(if (Cond, Then, Else) , 

{ coerce_to_boolean(Cond) o Then, 
~coerce_to_boolean(Cond) o Else 
». 

2.2 Implementation 

A rule of the form x T{x)/X—p will be written as replace (T, Lambda, Rho) . Rules of the more 
general form xi . . . Xn Ti{xi) . . . T„(x„)/A__p will be discussed in ^ The algorithm consists of 
nine steps composed as in figure |l|. 

The names of these steps are mostly derived from Q and even though the transductions 
involved are not exactly the same. In particular, the steps derived from Mohri & Sproat (r, f , 11 
and 12) will all be defined in terms of the finite state calculus as opposed to Mohri & Sproat's 
approach of using low-level manipulation of states and transitions.^ 

The first step, non_markers, was already defined above. For the second step, we first consider 
a simple special case. If the empty string is in the language described by Right, then r (Right) 
should insert an rb2 in every string position. The definition of r (Right) is both simpler and more 
efficient if this is treated as a special case. To insert a bracket in every possible string position, 
we use: 

[[[] X rb2,sig]*, [] X rb2] 

If the empty string is not in Right, then we must use intro (rb2) to introduce the marker rb2, 
followed by l_iff jr to ensure that such markers are immediately followed by a string in Right, 
or more precisely a string in Right where additional instances of rb2 are freely inserted in any 
position other than the beginning. This expression is written as: 

^The alternative implementation is provided in ]l7| . 
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intro (rb2) 
o 

l_if f _r (rb2 , xign (non_markers (Right ) , rb2) ) 

Putting these two pieces together with the conditional yields: 

macro (r(R) , 

if([] k R, 1 If: [] is in R: 

[[[] X rb2,sig]*, [] X rb2] , 
intro (rb2) 1 Else: 
o 

l_if f _r(rb2,xign(non_markers(R) ,rb2)))) . 

The third step, f (domain (T) ) is implemented as: 

macro (f (Phi) , intro (lb2) 
o 

l_iff_r(lb2, [xignx(non_markers(Phi) ,b2) , 
Ib2~,rb2])) . 

The lb2 is first introduced and then, using l_iff_r, it is constrained to occur immediately 
before every instance of (ignoring complexities) Phi followed by an rb2. Phi needs to be marked 
as normal text using non_markers and then xign_x is used to allow freely inserted lb2 and rb2 
anywhere except at the beginning and end. The following lb2~ allows an optional lb2, which 
occurs when the empty string is in Phi. 

The fourth step is a guessing component which (ignoring complexities) looks for sequences 
of the form lb2 Phi rb2 and converts some of these into Ibl Phi rbl, where the bl marking 
indicates that the sequence is a candidate for replacement. The complication is that Phi, as always, 
must be converted to non_markers (Phi) and instances of b2 need to be ignored. Furthermore, 
between pairs of Ibl and rbl, instances of lb2 are deleted. These lb2 markers have done their 
job and are no longer needed. Putting this all together, the definition is: 

macro (left_to_right (Phi) , 
[ [xsig* , 

[lb2 X Ibl, 
(ign(non_markers (Phi) ,b2) 
o 

inverse (intro (lb2) ) 
), 

rb2 X rbl] 
] * , xsig*] ) . 

The fifth step filters out non-longest matches produced in the previous step. For example (and 
simplifying a bit), if Phi is ab*, then a string of the form . . . rbl a b Ibl b . . . should be ruled out 
since there is an instance of Phi (ignoring brackets except at the end) where there is an internal 
Ibl. This is implemented as]^ 

macro (longest_match(Phi) , 
not($$( [Ibl, 

(ignx(non_markers (Phi) , brack) 
& 

$$(rbl) 

) , % longer match must be 

^The line with $$ (rbl) can be optimized a bit: Since we know that an rbl must be preceded by Phi, we can write; 
[ign_(non_mark6rs(Phi) .brack) , rbl , xsig*] ) . This may lead to a more constrained (hence smaller) transducer. 
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rb 7o followed by an rb 
])) 7o so context is ok 

o 

7o done with rb2, throw away: 
inverse (intro (rb2) ) ) . 

The sixth step performs the transduction described by T. This step is straightforwardly imple- 
mented, where the main difficulty is getting T to apply to our specially marked string: 

macro (aux_replace (T) , 
{{sig,lb2}, 
[Ibl, 

inverse (non_markers) 
o T o 
non_markers , 
rbl X [] 

] 

>*). 

The seventh step ensures that Ibl is preceded by a string in Left: 

macro (11 (L) , 

ign(if _s_then_p( 

ignx( [xsig*,non_markers(L)] ,lbl) , 
[Ibl.xsig*] ) , 
lb2) 
o 

inverse (intro (Ibl) ) ) . 

The eighth step ensures that lb2 is not preceded by a string in Left. This is implemented 
similarly to the previous step: 

macro (12 (L) , 
if _s_then_p( 

ignx(not( [xsig*,non_markers(L)] ) ,lb2) , 
[lb2,xsig*]) 
o 

inverse (intro (lb2) ) ) . 

Finally the ninth step, inverse (nonjnarkers) , removes the O's so that the final result in not 
marked up in any special way. 

3 Longest Match Capturing 

As discussed in §0 the POSIX standard requires that multiple captures follow a longest match 
strategy. For multiple captures as in (^, one establishes first a longest match for domain(Ti) ■ 
. . . ■ domain{Tn). Then we ensure that each of domain{Ti) in turn is required to match as long as 
possible, with each one having priority over its rightward neighbors. To implement this, we define 
a macro lm_concat (Ts) and use it as: 

replace (lm_concat (Ts) ,Lef t .Right) 

Ensuring the longest overall match is delegated to the replace macro, so lm_concat (Ts) needs 
only ensure that each individual transducer within Ts gets its proper left-to-right longest matching 
priority. This problem is mostly solved by the same techniques used to ensure the longest match 
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macro (lm_concat (Ts) ,mark_bouiidaries (Domains) o ConcatTs):- 
domainsds, Domains) , concatT(Ts,ConcatTs) . 

domains ([],[]). 

domains ( [F I RO] , [domain (F) I R] ) : - domains (RO ,R) . 
concatT([] , []) . 

concatK [TiTs] , [inverse (non.markers) o T.lbl x []|Rest]):- concatT(Ts,Rest) . 

y.y. macro (mark_boundaries (L) , Exp) : This is the central component of Im.concat. For our 

y.y. "toplological" example we will have: 

y,y« mark_boimdaries ( [domain ( [{ [t , o] , [t , o , p] } , [] : #] ) , 

y.Z domain([{o, [p, 0,1,0] >,[] : #] ) , 

y.y. domain({[g,i,c,a,l] ,[o",l,o,g,i,c,a,l]})]) 

y.y. which simplifies to: 

y.y. mark_boundaries([{[t,o] , [t,o,p]}, {o, [p,o,l,o]}, { [g,i , c , a, 1] , [o" ,l,o,g, i, c,a,l] >] ) . 

y.y. Then by macro expansion, we get: 

y.y. [{ [t ,0] , [t ,o,p] } o non_markers , [] X Ibl, 

y.y. {o , [p, 0,1,0] } o non_markers , [] X Ibl, 

y.y. { [g,i,c,a,l] , [o*,l,o,g,i,c,a,l]} o non.markers ,[] x Ibl] 

y.y. o 

y.y. y. Filter l: -[[t,©] , [t,o,p]> gets longest match 

y.y. ~ [ignx_l (non.markers ({ [t , o] , [t , o ,p] }) , Ibl) , 
y.y. ign(non_markers ({o , [p,o,l,o]}) ,lbl) , 

y.y. ign (non.markers ({[g,i,c, a, 1] , [o" ,l,o,g,i,c,a,l]}) ,lbl)] 

y.y. o 

VL '/> Filter 2: {o, [p,o,l,o]> gets longest match 

y.y. - [non_markers({[t,o] , [t,o,p]}) ,lbl, 

y.y. ignx_l(non_markers ({o , [p , o , 1 , o] ]■) , Ibl) , 

y.y. ign (non.markers ({[g,i,c, a, 1] , [o" ,l,o,g,i,c,a,l]}) ,lbl)] 

macro (mark_boundaries(L) ,Exp) :- 

boundaries (L, ExpO) , '/„ guess boundary positions 
greed (L,ExpO, Exp) . % filter non-longest matches 

boundaries ([],[]). 

boundaries ( [F I RO] , [F o non.markers, [] x Ibl |R]):- boundaries(RO,R) . 

greed(L,ComposedO, Composed) :- 

aux.greed(L, [] , Filters) , compose.list (Filters, ComposedO, Composed) . 

aux_greed([H|T] , Front, Filters) :- aux.greed (T , H , Front , Filters , .CurrentFilter ) . 

aux.greed( [] ,F,_, [] , [ign(non.markers(F) ,lbl)] ) . 

aux_greed( [HIRO] ,F, Front, [~L1|R] , [ign(non_markers(F) ,lbl) IRI]) :- 

append(Front , [ignx_l (non.markers (F) ,lbl) |R1] ,L1) , 
append(Front , [non.markers (F) ,lbl] ,NewFront) , 
aux.greed(RO,H,NewFront,R,Rl) . 

y.y. ignore at least one instance of E2 except at end 
macro(ignx.l(El,E2) , range(El o [[? *,[] x E2] + ,? +])). 

compose.list ( [] ,SoFar,SoFar) . 

compose.list ( [F I R] ,SoFcLr, Composed) :- compose.list (R, (SoFar o F) , Composed) . 

Figure 2: Definition of Im.concat operator. 



9 



within the replace macro. The only compHcation here is that Ts can be of unbounded length. So 
it is not possible to have a single expression in the finite state calculus that applies to all possible 
lenghts. This means that we need something a little more powerful than mere macro expansion 
to construct the proper finite state calculus expression. The FSA Utilities provides a Prolog hook 
for this purpose. The resulting definition of lm_concat is given in figure |^. 

Suppose (as in |^), we want to match the following list of recognizers against the string 
topological and insert a marker in each boundary position. This reduces to applying: 

lm_concat ( [ 

[{[t,o] , [t,o,p]>, [] : '#'] , 
[-[o, [p, 0,1,0]}, [] : '#'] , 
{[g,i,c,a,l] , [o~,l,o,g,i,c,a,l]} 
]) 

This expression transduces the string topological only to the string top#o#logicalJj 



4 Conclusions 

The algorithm presented here has extended previous algorithms for rewrite rules by adding a 
limited version of backreferencing. This allows the output of rewriting to be dependent on the form 
of the strings which are rewritten. This new feature brings techniques used in Perl-like languages 
into the finite state calculus. Such an integration is needed in practical applications where simple 
text processing needs to be combined with more sophisticated computational linguistics techniques. 

One particularly interesting example where backreferences are essential is cascaded determin- 
isticflongest match) finite state parsing as described for example in Abney and various papers 
in p4| . Clearly, the standard rewrite rules do not apply in this domain. If NP is an NP recognizer, 
it would not do to say NP =^ [NP]/X—p. Nothing would force the string matched by the NP to 
the left of the arrow to be the same as the string matched by the NP to the right of the arrow. 

One advantage of using our algorithm for finite state parsing is that the left and right contexts 
may be used to bring in top-down filtering.^ An often cited advantage of finite state parsing is ro- 
bustness. A constituent is found bottom up in an early level in the cascade even if that constituent 
does not ultimately contribute to an S in a later level of the cascade. While this is undoubtedly 
an advantage for certain applications, our approach would allow the introduction of some top- 
down filtering while maintaining the robustness of a bottom-up approach. A second advantage 
for robust finite state parsing is that bracketing could also include the notion of "repair" as in 0| . 
One might, for example, want to say something like: xy => [np Repair Det(x) RepairN(y)]/ X__f^ 
so that an NP could be parsed as a slightly malformed Det followed by a slightly malformed N. 
RepairDet and RepairN, in this example, could be doing a variety of things such as: contextual- 
ized spelling correction, reordering of function words, replacement of phrases by acronyms, or any 
other operation implemented as a transducer. 

Finally, we should mention the problem of complexity. A critical reader might see the nine 
steps in our algorithm and conclude that the algorithm is overly complex. This would be a false 
conclusion. To begin with, the problem itself is complex. It is easy to create examples where the 
resulting transducer created by any algorithm would become unmanageably large. But there exist 
strategies for keeping the transducers smaller. For example, it is not necessary for all nine steps 
to be composed. They can also be cascaded. In that case it will be possible to implement different 

'""An anonymous reviewer suggested that Im.concat could be implemented in the framework of as: 

[t o |t o p I o I p o 1 o ] — > ... #; 

Indeed the resulting transducer from this expression would transduce topological into top#o#logical. But unfor- 
tunately this transducer would also transduce polotopogical into polo#top#o#gical, since the notion of left-right 
ordering is lost in this expressiorL 

^The bracketing operator of on the other hand, does not provide for left and right contexts. 

^The syntax here has been simplified. The rule should be understood as: replace(lm_concat([[]:'[np', repair_det, 
repair _n, []:']'], lambda, rho). 
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steps by different strategies, e.g. by deterministic or non-deterministic transducers or bimachines 
| p5[ . The range of possibilities leaves plenty of room for future research. 

References 

Steve Abney. Rapid incremental parsing with repair. In Proceedings of the 6th New OED 
Conference: Electronic Text Rese arch, pages 1-9, 1990. 

Steven Abney. Partial parsing via finite-state cascades. In Proceedings of the ESSLLI '96 
Robust Parsing Workshop, 1996. 

Jeffrey Friedl. Mastering Regular Expressions. O'Reilly & Associates, Inc., 1997. 

C. Douglas Johnson. Formal Aspects of Phonological Descriptions. Mouton, The Hague, 1972. 

Ronald Kaplan and Martin Kay. Regular models of phonological rule systems. Computational 
Linguistics, 20(3):331-379, 1994. 

L. Karttunen, J-P. Chanod, G. Grefenstette, and A. Schiller. Regular expressions for language 
engineering. Natural Language Engineering, 2(4):305-238, 1996. 

Lauri Karttunen. The replace operator. In 33th Annual Meeting of the Association for 
Computational Linguistics, M.I.T. Cambridge Mass., 1995. 

Lauri Karttunen. Directed replacement. In 34th Annual Meeting of the Association for 
Computational Linguistics, Santa Cruz, 1996. 

Lauri Karttunen. The replace operator. In Emannual Roche and Yves Schabes, editors, 
Finite-State Language Processing, pages 117-147. Bradford, MIT Press, 1997. 

Lauri Karttunen. The proper treatment of optimality theory in computational phonology. In 
Finite-state Methods in Natural Language Processing, pages 1-12, Ankara, June 1998. 

Nils Klarlund. Mona & Fido: The logic automaton connection in practice. In CSL '97, 1997. 

Mehryar Mohri and Richard Sproat. An efficient compiler for weighted rewrite rules. In 34th 
Annual Meeting of the Association for Computational Linguistics, Santa Cruz, 1996. 

Emmanuel Roche and Yves Schabes. Deterministic part-of-speech tagging with finite-state 
transducers. Computational Linguistics, 21:227-263, 1995. Reprinted in Roche & Schabes 
(1997). 

Emmanuel Roche and Yves Schabes, editors. Finite-State Language Processing. MIT Press, 
Cambridge, 1997. 

Emmanuel Roche and Yves Schabes. Introduction. In Emmanuel Roche and Yves Schabes, 
editors, Finite-State Language Processing. MIT Press, Cambridge, Mass, 1997. 

Gertjan van Noord. Fsa utilities, 1997. The FSA Utilities toolbox is available free of charge 



under Gnu General Public License at http://www.let.rug.nl/~vannoord/Fsa/ 



Gertjan van Noord and Dale Gerdemann. An extendible regular expression compiler for finite- 
state approaches in natural language processing. In Workshop on Implementing Automata 
99, Potsdam Germany, 1999. 



11 



